fix: avx512vl masked load/store#1353
Conversation
fe2938e to
ea882e6
Compare
| if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then | ||
| CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512" | ||
| CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128" | ||
| CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128" |
There was a problem hiding this comment.
oopsie. Thanks for fixing this one.
serge-sans-paille
left a comment
There was a problem hiding this comment.
I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.
| // and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp) | ||
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept | ||
| XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>> |
There was a problem hiding this comment.
there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?
|
|
||
| template <class A> | ||
| XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept | ||
| XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept |
There was a problem hiding this comment.
wgy that change? In my mental model, the masked store take a bool mask, not an integer mask
There was a problem hiding this comment.
We _mm256_maskstore_ps takes a __m256i for mask. The bool mask here available in the function calling this utility is backed by a floating point type.
| // single templated implementation for integer masked loads (32/64-bit) | ||
| template <class A, class T, bool... Values, class Mode> | ||
| template <class A, class T, bool... Values, class Mode, | ||
| class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>> |
There was a problem hiding this comment.
I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl
There was a problem hiding this comment.
I tried, I'll have another look. Without this constraint all compilers work fine except gcc-10 :(
| } | ||
| template <class A, bool... Values, class Mode> | ||
| template <class A, bool... Values, class Mode, | ||
| class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>> |
There was a problem hiding this comment.
same architecture mix reference issue here.
| #if XSIMD_WITH_AVX | ||
| #include "./xsimd_avx.hpp" | ||
| // clang-format off | ||
| // _128 first: avx half-fold recursive call needs avx_128 visible at parse time. |
| * @ingroup architectures | ||
| * | ||
| * AVX512DQ instructions | ||
| * AVX512VL instructions |
| template <typename T, std::size_t N> | ||
| struct make_sized_batch; | ||
| template <typename T, std::size_t N> | ||
| using make_sized_batch_t = typename make_sized_batch<T, N>::type; |
| message(STATUS "Using emulated target: ${TARGET_EMULATED}") | ||
| set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1) | ||
| unset(TARGET_ARCH CACHE) | ||
| elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "") |
There was a problem hiding this comment.
Per https://cmake.org/cmake/help/latest/command/if.html#constant I think
if(XSIMD_DEFAULT_ARCH)
is enough
| static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch"); | ||
| #else | ||
| namespace xsimd | ||
| { |
|
@DiamonDinoia I run into similar issues of I haven't look at assembly (just making it build with the new updated CI settings). The C++ diff should be fairly small, I wonder if you'd be able to get anything useful from it. I was able to simplify much more the functions from That being said, my solution currently stalls on |
Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH (idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.
Split the avx_128 variable swizzle into explicit float/double overloads with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.
Add the missing int64/uint64/float/double load_masked overloads and
correct the store_masked batch_bool_constant typing on avx512vl_128 and
avx512vl_256, branching aligned vs unaligned to the right EVEX intrinsic
(vmovdqu{32,64}{k}{z} / vmov{a,u}p{s,d}{k}{z}); unsigned overloads
delegate via bitwise_cast. Resolve the avx/avx2/avx512f half-fold target
through make_sized_batch_t<T, half>::arch_type so a 512-bit masked op
picks the VL arch and emits EVEX instead of VEX vpmaskmov*/vmaskmov*.
ea882e6 to
fa06792
Compare
Drop the cross-arch SFINAE/tag mechanism: a concrete requires_arch<avx512vl_128|256> overload now beats the inherited avx2/avx2_128 one by overload conversion ranking, so no arch file knows about another. xsimd_common_memory.hpp keeps only requires_arch<common> and dispatches on the arch-agnostic trait masked_memory_uses_fp_bitcast (integral with a same-width float register -> reuse that float vmaskmov* path, else a scalar buffer). avx/avx2/avx2_128 drop every is_base_of<avx512vl_*, A> guard; avx2_128 routes native 128-bit integer masked memory through vpmaskmov* (long long* cast for 64-bit) and tags int64/uint64 on avx2_128 (those intrinsics need AVX2). detail::maskstore takes a bool mask and casts internally; xsimd_batch.hpp keeps a make_sized_batch fwd-decl and simplifies the store_masked call; xsimd_isa.hpp documents the _128-first include order; sse2.hpp adapts to the new store_masked(common) signature.
fa06792 to
5a40538
Compare
|
@serge-sans-paille ready for a second round of review. Probably over commented because I chatted with @claude how to best do this in c++14 (No if constexpr...) my solutions where so SFINAE heavy so I asked multiple times how to simplify this and the final outcome is this one. I left the comments in to kind of explain a bit more, happy to trim/clean them after a second round of review. |
xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode)usedto compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in
xsimd_batch.hpp:766over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar
commonfallback.The load side was unaffected — its call site (
:743) was already correct.Codegen — before (master, commit
7d30b9cc)Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377into four per-lane 32-bit stores, materialisingeach active lane no mask instruction involved.
Codegen — after
One masked instruction.
Bug fixes
AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public
batch::store(mask)call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEXvmov*{k}intrinsic.CI
linux.ymltypo silently droppedCXXFLAGSin the VL_128 matrix row ($CXX_FLAGSvs$CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.avx512vlregister-traits comment misattributed to AVX512DQ.Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded
sse4_2/avx2as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder +make_sized_batch_t<T, half>.Features added
load_masked/store_maskedforavx512vl_128andavx512vl_256covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archsinherit.
default_archmatchesXSIMD_DEFAULT_ARCHwhen the macro is set — astatic_assertintest_arch.cpp(plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.Cleanups
detail::maskstorehelpers now takebatch<>types, symmetric withdetail::maskload.selectuse a typed variable (const batch<T, A> lo = …) instead ofconst auto lo = batch<T, A>{ … }.XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.xsimd_avx.hpp/xsimd_avx2.hpp/xsimd_avx512f.hppusesmake_sized_batch_t<T, half>instead of a hardcoded arch — picksavx_128/avx2_128/avx512vl_256when available, so half-confined stores land on the EVEX or VEX masked intrinsic.xsimd_isa.hppinclude order:_128siblings before their wider arch, VL beforeavx512f.hpp. Required so the recursivestore_masked<half_arch>call sees the better-arch overload at template-definition time.(Changelog by @claude)